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Abstract 


We  propose  a  method  for  approximating  integrated  likelihoods,  or  posterior  normalizing 
constants,  in  finite  mixture  models,  for  which  analytic  approximations  such  as  the  Laplace 
method  are  invalid.  Integrated  likelihoods  are  key  components  of  Bayes  factors  and  of 
the  posterior  model  probabilities  used  in  Bayesian  model  averaging.  The  method  starts 
by  formulating  the  model  in  terms  of  the  unobserved  group  memberships,  Z,  and  making 
these,  rather  than  the  model  parameters,  the  variables  of  integration.  The  integral  is  then 
evaluated  using  importance  sampling  over  the  Z.  The  tricky  part  is  choosing  the  importance 
sampling  function,  and  we  study  the  use  of  mixtures  as  importance  sampling  functions.  We 
propose  two  forms  of  this:  defensive  mixture  importance  sampling  (DMIS),  and  Z-distance 
importance  sampling.  We  choose  the  parameters  of  the  mixture  adaptively,  and  we  show 
how  this  can  be  done  so  as  to  approximately  minimize  the  variance  of  the  approximation  to 
the  integral. 

The  resulting  method  is  easy  to  implement,  involving  only  simple  multinomial  sampling, 
it  is  almost  as  easy  for  complex  mixture  models  as  for  simple  ones,  and  it  extends  easily 
to  more  complicated  mixture  models.  The  simulated  values  on  which  it  is  based  are  inde¬ 
pendent,  and  so  it  avoids  problems  of  convergence  due  to  dependence  of  successive  iterates. 
We  also  propose  a  way  of  dealing  with  the  label-switching  problem.  The  method  provides 
a  standard  error,  and  so  is  to  some  extent  self-monitoring.  In  simulations  based  on  a  prob¬ 
lem  in  molecular  biology,  the  methods  performed  well.  The  approach  can  be  applied  more 
generally  to  models  that  are  simple  when  written  in  a  complete-data  form,  i.e.  that  are 
amenable  to  the  EM  algorithm,  such  as  models  for  missing  data,  for  censoring  or  truncated 
data,  random  effects  or  variance  components. 

Key  Words:  Allelotype  Bayes  factor;  Bayesian  model  averaging;  Beta-binomial  distribution; 
Defensive  mixture  importance  sampling;  Gibbs  sampling;  Label-switching;  Markov  chain 
Monte  Carlo;  Multimodality. 
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1  Introduction 


The  integrated  likelihood,  sometimes  also  called  the  marginal  likelihood,  plays  an  essential 
role  in  Bayesian  inference  and  testing,  as  it  is  the  central  component  of  the  Bayes  factor  for 
comparing  two  models,  and  of  the  posterior  model  probability  of  one  model  conditional  on 
data  and  on  a  set  of  candidate  models.  It  also  plays  a  role  in  Bayesian  estimation,  as  the 
normalizing  constant  for  the  posterior  distribution.  The  integrated  likelihood  of  a  model  is 

I  =  pr(x)  =  J  /(x|T)p(r)dr,  (1) 

where  x  denotes  the  observed  data,  /(x|r,  M)  is  the  likelihood  function  for  the  parameter 
r  under  the  model,  and  p(r)  is  the  density  (or  probability  mass  function)  for  the  prior 
distribution  of  r  given  the  model. 

Since  the  integrated  likelihood  often  is  not  analytically  tractable,  a  body  of  literature  on 
the  use  of  numerical  methods  for  its  calculation  has  developed.  These  are  reviewed  in  Evans 
and  Swartz  (1995)  and  include  methods  based  on  quadrature  rules,  Laplace’s  method,  im¬ 
portance  sampling  and  Markov  Chain  Monte  Carlo  (MCMC).  Combinations  of  MCMC  with 
importance  sampling  and  the  Laplace  method  are  considered  by  Rozenkranz  and  Raftery 
(1994),  Raftery  (1996b)  and  Lewis  and  Raftery  (1997).  The  Bayesian  Information  Criterion 
(BIC)  can  be  used  as  an  asymptotic  approximation  to  the  log  Bayes  factor  (Schwarz,  1978; 
Kass  and  Wasserman,  1995). 

In  finite  mixture  models,  however,  none  of  these  methods  is  fully  satisfactory.  Two 
features  of  mixture  models  make  many  current  methods  for  approximating  the  integrated 
likelihood  problematic.  The  first  is  that  the  model  is  not  “regular”  for  testing  and  model 
selection  purposes.  In  regular  models,  the  log-likelihood  becomes  approximately  elliptically 
contoured  when  there  are  enough  data,  even  when  the  true  parameter  values  correspond 
to  a  lower-dimensional  submodel  that  one  is  trying  to  test.  In  this  standard  situation, 
for  example,  the  likelihood-ratio  test  statistic  has  an  approximate  asymptotic  chi-squared 
distribution  with  degrees  of  freedom  equal  to  the  difference  in  the  number  of  parameters.  This 
does  not  hold  in  finite  mixture  models  whenever  one  estimates  a  model  with  G  components 
but  the  true  number  of  components  is  smaller,  so  that  the  true  parameter  values  lie  on  the 
edge  of  the  parameter  space  (Lindsay  1995). 

A  second  feature  is  the  “label-switching”  problem,  namely  that  the  likelihood  is  invariant 
to  relabelling  of  the  mixture  components,  and  so  has  Gl  modes  of  the  same  height.  Addi¬ 
tional  local  modes  are  often  present  (Lindsay,  1995;  Titterington,  Smith  and  Makov,  1985), 
especially  when  more  than  G  components  are  fitted  (Atwood  et  al,  1992). 
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The  Laplace  method  (e.g.  Tierney  and  Kadane  1986)  provides  an  analytic  approxima¬ 
tion  to  the  integrated  likelihood  based  on  the  assumption  that  the  posterior  distribution  is 
approximately  elliptically  contoured  (e.g.  Raftery  1996a),  and  when  this  assumption  holds 
it  can  provide  approximations  of  remarkable  quality  (e.g.  Tierney  and  Kadane  1986;  Grun- 
wald,  Guttorp  and  Raftery  1993;  Lewis  and  Raftery  1997).  However,  for  mixture  models 
this  assumption  fails  when  the  model  being  fit  has  G  components  and  the  actual  number 
of  components  is  smaller  (Lindsay  1995),  which  is  a  situation  of  great  interest  for  model 
comparison  and  testing.  Thus  the  Laplace  method  does  not  work  in  this  situation. 

The  original  justification  of  the  BIG  was  in  terms  of  the  Laplace  method,  and  it  provides 
a  good  approximation  to  the  integrated  likelihood  in  regular  models  for  a  unit  information 
prior  on  the  parameters  (Kass  and  Wasserman  1995;  Raftery  1995).  This  justification  does 
not  hold  for  mixture  models,  although  BIG  does  provide  a  consistent  estimate  of  the  number 
of  components  in  the  mixture  (Leroux  1992;  Keribin  1998),  it  leads  to  density  estimates  that 
are  consistent  for  the  true  density  (Roeder  and  Wasserman  1997),  and  it  has  given  good 
results  in  a  range  of  applications  (e.g.  Dasgupta  and  Raftery  1998;  Fraley  and  Raftery  1998, 
2000). 

Quadrature  methods  can  be  used  but  they  begin  to  break  down  for  problems  with  more 
than  4  parameters  (Evans  and  Swartz  1995),  and  the  number  of  parameters  in  mixture 
models  quickly  surpasses  this  as  the  number  of  groups  and/or  the  dimension  of  the  data 
increase.  When  testing  or  comparing  mixture  models,  one  is  typically  considering  at  least 
some  models  that  have  substantial  numbers  of  parameters. 

Markov  chain  Monte  Carlo  (MCMC)  can  be  used  to  estimate  mixture  models,  and  asso¬ 
ciated  methods  can  be  used  to  approximate  integrated  likelihoods  (e.g.  Chib  1995,  Raftery 
1996b).  However,  in  addition  to  the  usual  problems  with  MCMC  methods  (dependent  sam¬ 
ples,  convergence  issues,  complexity  of  programming  and  implementation),  in  mixture  models 
they  can  easily  fall  foul  of  the  label-switching  problem  (Celeux  1997;  Stephens  1997,  2000). 
For  example,  Neal  (1998)  pointed  out  that  Chib’s  (1995)  results  for  a  mixture  model  were 
in  error  for  this  reason.  Assessing  the  accuracy  of  the  estimated  integrated  likelihoods  is  not 
trivial  with  MCMC  because  of  the  dependence  between  successive  samples. 

Reversible  jump  MCMC  methods  can  be  used  to  estimate  Bayes  factors  and  posterior 
model  probabilities  for  mixture  models  (Richardson  and  Green  1997),  but  they  are  also 
prone  to  label-switching  problems,  and  convergence  can  be  even  more  of  a  problem  than  for 
regular  MCMC.  For  example,  in  the  rejoinder  to  the  discussion  of  their  paper,  Richardson 
and  Green  (1997)  mentioned  that  diagnostics  indicated  that  their  method  may  not  have 
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converged  even  after  500,000  iterations  for  the  one-dimensional  mixture  model  of  the  galaxy 
data  they  analyzed.  The  difficulty  of  implementing  reversible  jump  MCMC  efficiently  for 
mixture  models  seems  to  increase  with  the  dimension  of  the  data. 

Our  goal  in  this  paper  is  to  propose  importance  sampling  methods  for  integrated  likeli¬ 
hoods  in  mixture  models  that  are  easy  to  implement  and  that  avoid  the  difficulties  we  have 
been  discussing.  The  method  consists  first  of  reformulating  the  model  in  terms  of  the  unob¬ 
served  group  memberships,  Z,  as  is  done  for  example  for  estimation  via  the  EM  algorithm, 
and  then  estimating  the  integrated  likelihood  using  importance  sampling  on  the  Z.  Analytic 
or  quadrature  methods  are  used  for  integration  over  r;  this  is  a  “complete-data”  problem 
and  as  such  is  often  straightforward. 

The  success  of  any  importance  sampling  method  depends  critically  on  the  importance 
sampling  function,  and  here  two  methods  for  creating  this  function  are  proposed:  (1)  de¬ 
fensive  mixture  importance  sampling  (Hesterberg  1995),  and  (2)  sampling  via  perturbation 
of  an  initial  grouping  that  has  high  posterior  probability  (the  “Z-distance  method”).  The 
second  method  appears  to  be  new.  In  both  of  these  methods,  the  importance  sampling 
function  is  itself  a  mixture. 

Our  goals  for  integrated  likelihood  estimators  are  accurate  estimates  of  the  integral, 
realistic  estimates  of  precision,  practicality  (in  terms  of  programming  time  and  computational 
time),  and  conceptual  simplicity.  Simulation  studies  of  easy  to  modestly  difficult  integration 
problems  suggest  that  our  strategy  achieves  all  of  these  goals. 

The  resulting  strategy  is  easy  to  implement,  involving  only  simple  multinomial  sampling. 
The  samples  on  which  it  is  based  are  independent,  so  there  are  no  problems  of  noncon¬ 
vergence  due  to  dependence  between  samples,  as  with  MCMC.  Our  approach  includes  a 
way  of  dealing  with  the  label-switching  problem  that  can  plague  MCMC,  and  also  provides 
estimated  standard  errors,  and  so  is  to  an  extent  self-monitoring. 

Our  approach  does  not  become  much  more  complicated  as  the  complexity  of  the  mixture 
model  increases.  Thus  it  can  be  used  with  almost  equal  ease  for  high-dimensional  mixture 
models  as  for  univariate  ones,  and  for  complex  mixture  models  as  for  simple  ones.  It  extends 
easily  to  most  finite  mixture  models.  It  seems  to  provide  quite  good  approximations  to  the 
integrated  likelihood,  and  also  reliable  estimates  of  the  associated  standard  error. 

In  Section  2  we  review  mixture  models  and  present  our  importance  sampling  based  esti¬ 
mators.  In  Section  3  we  show  how  these  estimators  are  applied  in  the  context  of  multimodal¬ 
ity  due  to  the  label-switching  problem.  In  Section  4  we  present  simulation  results  motivated 
by  an  application  in  molecular  genetics,  and  in  Section  5  we  discuss  advantages,  limitations 
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and  directions  for  future  research. 


2  Mixture  Models  and  Importance  Sampling  Methods 


2.1  Finite  Mixture  Models 


Let  X  =  (xi,...,x„)  be  a  realization  of  a  random  sample  =  (Xi,...,X„)  from  a 

G-component  mixture  distribution.  The  corresponding  likelihood  is 

n  G  n 

=  n  (2) 

i=l j=l  i=l 

where  the  tt^’s  are  mixing  proportions  that  sum  to  1,  vr  =  (tti,  . . . ,  ttg),  and  the  Oj's  are 
component-specific  parameter  vectors  with  9  =  {9[, . . .  .O'q)' .  Each  observation,  Xj,  arises 
from  one  of  the  G  component  densities,  fj,j  =  1, . . .  ,G,  but  the  group  memberships  are 
unknown.  The  parameter  tt^  is  the  unknown  probability  of  an  observation  arising  from  fj. 

To  obtain  the  integrated  likelihood,  or  marginal  probability  of  the  data,  the  joint  distri¬ 
bution  of  X  and  r  =  {9,  vr)  is  integrated  with  respect  to  the  unknown  parameters: 

/n 

Ylf{-Xi\9,Tr)p{9,Tr)dT,  (3) 

i=i 

where  p{9,  vr)  is  the  prior  density  for  {9,  vr)  =  r.  Analytic  integration  of  (3)  is  virtually  always 
impossible. 

The  component  membership  for  Xj  may  be  thought  of  as  an  unobserved  random  variable. 
When  the  component  membership  is  known,  the  likelihood  takes  a  simpler  form.  Let  z,  = 
{zii, . . . ,  Zio)'  be  the  vector  that  indicates  component  membership  for  the  observation  such 
that  Zij  =  1  if  Xj  is  from  component  j  and  0  otherwise.  The  nx  G  matrix  Z„  =  {z[, . . . ,  z^}' 
gives  the  component  membership  for  the  entire  sample,  and  we  let  Zi  and  be  the  random 
variables  corresponding  to  z,  and  Z„,  respectively.  Then 


=  /  n  /  fx\z{^i\2i  =  Zi,9,7^)fz{zi\9,7^)dzip{9,7^)d 

JT  Jzi 
„  „  n  G 


(4) 

(5) 


Note  that  we  have  assumed  that  fz{zi\9,T^)  =  fz{zi\T^)  =  In  the  rest  of  this 

article,  we  also  make  the  (usually  natural)  assumption  that  9  and  vr  are  indepedent  a  priori, 
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so  that  p{0,Tr)  =  p{0)p{Tr)  (where  it  is  understood  that  the  refer  to  different  functions, 
depending  on  the  argument). 

This  formulation  greatly  simplifies  part  of  the  problem,  since  the  inner  integral  of  (5) 
(an  integration  with  respect  to  r  =  (0,71-))  often  can  be  evaluated  analytically,  or  at  least 
closely  approximated  via  the  Laplace  method  or  a  similar  approach,  and  may  also  be  more 
amenable  to  numerical  integration  via  quadrature.  The  problem  takes  the  following  general 
form: 

I  =  j  L{ii.\Z)p{Z)dZ.  (6) 

Here,  L(x|Z)  =  L(x|ZH  =  ZJ  =  niLi  /(x*|^*,  e)p{0)d0,  and  p{Z)  =  p{Z^^')  =  Z„)  = 
km=ifz{z.  i\'K)p{'K)d'K .  In  (6)  and  in  some  of  what  follows,  we  suppress  x  and  n  in  the 
context  of  a  fixed  realization  of  the  data.  For  the  purposes  of  this  article,  we  will  assume 
that  integration  with  respect  to  (0,  vr)  can  be  done  analytically.  Desai  (2000)  and  Desai  and 
Emond  (2001)  treat  cases  where  numerical  methods  are  needed  for  integration  with  respect 
to  (0,  tt). 


2.2  Importance  Sampling 


The  integral  with  respect  to  Z  in  (6)  is  the  summation  over  G”  points,  which  can  be  done 
exactly  for  very  small  data  sets.  Otherwise,  importance  sampling  is  a  possible  approach. 
The  importance  sampling  estimate  is  given  by 


(7) 


k=l 


where  the  Z^s  are  sampled  (simulated)  independently  from  the  density  h{-).  Importance 
sampling  has  classical  elegance:  I  converges  almost  surely  to  I  and  VK{i  —  7)/\/var(/) 
has  a  standard  normal  limiting  distribution  when  L(xjZ)p(Z)/h(Z)  has  finite  mean  and 
variance  under  h{-).  The  latter  result  can  provide  a  ready  means  for  assessing  the  precision 
of  /  through  its  standard  error,  provided  that  the  standard  error  is  an  accurate  estimate  of 
\/var(/). 

The  catch  to  using  (7)  is  the  choice  of  h{-),  since  this  choice  determines  how  easily  the 
sampling  can  be  done  in  practice  and  also  determines  the  precision  of  I.  Easy  choices  for  h{-) 
often  lead  to  inefficient  estimates.  For  example,  it  may  be  possible  to  simulate  the  Z’s  easily 
when  the  importance  sampling  function  is  the  prior  distribution  of  Z,  i.e.  when  h{Z)  =  p{Z). 
When  the  data  are  reasonably  informative,  this  results  in  sampling  mostly  Z/^'s  for  which 
L(x|Z;t)  has  no  substantial  mass,  leading  to  a  very  inefficient  estimator  (Raftery,  1996b). 
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Worse  yet,  the  empirical  variance  of  the  I^s  can  markedly  underestimate  the  true  variance 
of  /  under  these  circumstances,  leading  to  false  confidence  in  the  estimate.  Stephens  and 
Donnelly  (2000)  provide  an  extreme  example  of  this  phenomenon  in  practice. 

The  problem,  then,  is  to  find  a  good  choice  of  h{Z):  one  that  is  reasonably  easy  to  sample 
from,  and  that  also  provides  good  efficiency.  The  choice  of  h(Z)  that  minimizes  vai(I)  is 
p(Zjx)  =  =  ZjX^”)  =  x)  (c.f.  Zhang,  1996;  Appendix  A).  This  minimum  variance 

is  zero,  but  one  needs  to  know  I  in  order  to  know  p(Z|x),  so  sampling  directly  from  p(Z|x) 
is  not  possible.  A  potential  surrogate  for  p(Zjx)  is  p(Z|x,  r  =  f)  where  f  is  the  maximum 
likelihood  estimate  (MLE)  for  r  in  (2).  We  expect  this  to  be  a  reasonable  starting  point  for 
constructing  h(Z),  since  p(Z|X(”))/p(Z|X(”), r  =  f)  converges  in  probability  to  a  constant 
for  a  sequence  of  Z’s  near  the  mode  for  under  regularity  conditions  (see  Theorem  1). 
Sampling  from  p(Zjx, r  =  f)  has  also  been  suggested  by  Wei  and  Tanner  (1990)  in  the 
different  context  of  sampling  from  a  posterior  distribution;  they  called  this  method  Poor 
Man’s  Data  Augmentation,  or  PMDA. 

In  the  present  context,  sampling  directly  from  p(Z|x,  r  =  f)  has  at  least  three  potential 
pitfalls.  The  first  is  that,  unless  the  likelihood  is  extremely  peaked,  it  tends  to  miss  Z’s 
corresponding  to  important  values  of  L(xjZ),  as  the  sampling  can  stick  to  small  subregions 
of  the  sample  space  for  seemingly  reasonable  values  of  K.  In  this  situation,  the  sample 
variance  of  the  tends  to  underestimate  the  true  variance  of  I.  The  second  pitfall  is  that 
p(Z|x, r  =  f )  is  a  good  approximation  to  p(Z|x)  only  near  the  modal  value  of  Z,  and  the 
third  is  that  the  approximation,  so  far,  is  justified  only  when  the  fitted  model  is  correct  with 
no  superfluous  components  included  (Theorem  1).  In  this  paper  we  propose  two  methods  for 
constructing  importance  sampling  distributions  that  use  f  and/or  p(Zjx,  r  =  f)  to  create 
efhcient  importance  sampling  distributions  while  attempting  to  circumvent  these  pitfalls. 

2.3  Defensive  Mixture  Importance  Sampling 

Here  we  propose  the  use  of  importance  sampling  distributions  that  are  themselves  mixtures, 
with  one  component  equal  to  the  prior  distribution  so  as  to  ensure  coverage  of  the  area  in 
which  the  posterior  density  is  nonnegligible.  We  start  with  distributions  of  the  form 

MZ)  =  (i-%(z)  +  5p(z),  5e[o,i]  (8) 

for  calculation  of  (7),  where  g{Z)  is  an  intelligently-chosen  probability  mass  function  for  Z, 
and  p(Z)  is,  as  before,  the  marginal  prior  distribution  of  Z.  The  idea  of  using  importance 
sampling  distributions  that  are  themselves  mixtures  was  discussed  by  Geyer  (1991),  Oh 
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and  Berger  (1993),  West  (1993),  Givens  and  Raftery  (1996)  and  Raghavan  and  Cox  (1998) 
in  different  contexts.  The  idea  of  using  mixtures  of  the  form  (8)  as  importance  sampling 
distributions  for  evaluating  integrated  likelihoods  was  mentioned  by  Newton  and  Raftery 
(1994)  and  discussed  by  Raftery  (1996b).  However,  the  general  approach  of  using  mixtures 
of  this  form  as  importance  sampling  functions  was  discussed  in  greater  depth  by  Hesterberg 
(1995).  He  called  this  a  “defensive  mixture”  for  the  importance  sampling  distribution,  and 
so  we  will  refer  to  this  approach  as  defensive  mixture  importance  sampling  (DMIS). 

If  we  define  the  quantity  p{7jk)/h{7ik)  to  be  the  “weight”  for  the  term  in  (7),  then  the 
defensive  mixture  estimate  has  the  property  that  the  weight  is  bounded  by  1/5,  guaranteeing 
the  applicability  of  the  Gaussian  central  limit  theorem  for  6  G  [0, 1]  whenever  it  holds  for 
5  =  1,  i.e.  for  sampling  solely  from  the  prior.  In  addition,  the  asymptotic  variance  of  the 
defensive  mixture  estimate  can  be  no  more  than  1/5  times  the  variance  obtained  by  taking 
h{Z)  =  p(Z),  resulting  in  a  bound  on  how  much  worse  DMIS  can  do  relative  to  taking 
h(Z)=p(Z). 

In  order  to  improve  on  p(Z),  we  need  to  choose  g(-)  so  that  h(-)  mimics  p(Z|x).  Our 
proposal  is  to  take  g(Z)  =  p(Z|x, r  =  f)  with  f  being  the  MLE  for  r.  Note  that  this  is 
just  a  multinomial  distribution  with  mean  (zi . .  .z„)'  where  the  z,  =  £'(.Z)|x),r  =  #i)’s  are 
just  the  estimated  values  of  the  missing  z/s  resulting  from  the  E-step  of  the  EM  algorithm 
used  in  obtaining  f  from  (2).  This  proposal  is  supported  by  the  following  theorem,  which 
indicates  that  p(Zjx,  r  =  f)  is  near  p(Z|x)  when  Z  is  near  the  mode  of  L(x|Z)p(Z)  and  the 
fitted  model  is  correct.  The  superscript  (n)’s  are  included  in  the  statement  of  the  theorem 
to  make  the  dependence  on  n  explicit. 


Theorem  1  Let  he  a  random  sample  from  the  distribution  /x(x|t)  where  /x(x|r)  has 
the  representation  based  on  an  unobserved  variable  Z: 


/x(x|r)  =  J^fx\z{^\z,T)fz{z\T)dz. 


Let  be  the  unobserved  sample  from  /z(z|t).  Let  p„(Z„|X(”))  be  the  probability  that 
=  Z„  given  and  /et  p„(Z|X(”),  r)  be  the  probability  that  =  Z  given  (X(”),r). 
That  is, 


;^n(zjxW) 


I  Ui  fx\z{^i\Zi,r)fz{Zi\T)p{T)dT 

/(XW) 


and 


,,,(zix<”).r)=nn 

i  3 


[E,vr,/,(X/0,)J 


nnM%iXi,r). 

i  3 
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Define  ZijfiKi.r)  =  r]  =  p{zij  =  l|Xj,r),  and  let  Tn  be  the  MLE  for  r  in  (6)  based 

on  X^”).  Hence,  the  Zij(Ki,Tn)’s  are  the  expected  values  of  the  missing  data  used  in  the 
EM  algorithm  (Dempster,  Laird  and  Rubin,  1911).  Let  z(X^^^)  denote  the  “sample”  of  Zij’s 
formed  in  this  manner.  Note  that  Zij  is  not  restricted  to  {0,1}.  Define 


J{t)  =  -E 


log/x,z(Xj,i(Xj,ro)|r) 


and  , 


J(t)  =  -E 


il 


log/x(Xj|r) 


where  tq  represents  the  true  value  of  r  and  expectations  are  taken  under  the  true  value.  We 
assume  that  J{tq)  and  J{tq)  exist  and  are  positive  definite.  Assume  that  conditions  (1)  - 
(5)  in  Appendix  A  hold.  Then 


;5„(i(XH)|XW) 


A 


;5„(i(XH)|XH,r  =  f„)  '  |J(ro)|^/2' 

The  proof  is  given  in  Appendix  A. 


(9) 


Remarks:  1.  The  amount  of  unknown  information  in  this  problem  grows  in  proportion  to 
n.  The  theorem  provides  a  measure  of  closeness  between  p(Z„|x,  r  =  r)  and  p(Z„|x)  only  for 
a  single  sequence  of  Z„’s.  The  approximation  of  p(Z„|x)  by  p(Z„|x,  r  =  r)  may  not  be  good 
at  points  far  from  This  is  to  be  expected,  since  the  data  do  not  provide  information 

about  the  joint  distribution  of  X  and  Z. 


2.  Condition  2  requires  that  f„  converges  to  a  single  point.  Hence,  technically,  it  may  be 
necessary  to  restrict  the  integration  to  a  subset  of  the  parameter  space  so  that  L(xjZ)p(Z) 
is  unimodal  over  this  subset.  Allowing  f„  to  converge  over  a  set  of  G  points  still  leads  to  a 
finite  limit,  however  (see  Feng  and  McCulloch,  1996). 

3.  Condition  (3)  in  Appendix  A  and  the  assumption  of  positive  definiteness  for  J{tq)  and 
J(ro)  both  require  that  tt^  >  0,  j  =  1, ...  G.  It  is  of  interest  to  know  whether  these  condi¬ 
tions  may  be  relaxed,  since  we  wish  to  have  a  good  estimate  ofp(Zjx)  even  when  we  have  fit 
a  model  with  the  wrong  number  of  non-zero  components.  In  the  situation  where  too  many 
components  are  fitted,  Feng  and  McCulloch  (1996)  have  shown  that  consistency  of  still 
holds,  so  that  p(.2(X(”))|X(”)  ,  r  =  Tn)  will  have  the  same  limiting  value  under  the  over-fitted 
model  as  under  the  correct  model.  However,  the  remainder  term  in  expansion  (26)  does  not 
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converge  to  zero  in  this  situation,  and  |  j(To)|  and  |  J(to)|  are  both  zero.  In  this  situation, 
we  conjecture  that  the  ratio  on  the  LHS  of  (9)  is  where  d  is  the  number  of  zero 

components  in  the  over-fitted  model.  □ 


In  order  to  apply  the  proposed  DMIS  method,  one  needs  to  choose  6.  Hesterberg  (1995) 
found  that  his  simulation  results  were  not  sensitive  to  the  choice  of  6  within  a  range  from 
0.1  to  0.5.  We  found  that  5’s  in  this  range  gave  reasonably  unbiased  results  in  very  simple 
problems,  but  that  the  sample  variance  of  the  resulting  could  vary  by  a  factor  of  nearly 
16  as  S  varied,  even  in  these  simple  problems.  Also,  the  accuracy  of  the  standard  errors 
was  sensitive  to  6.  Hence,  a  way  of  choosing  6  is  needed.  The  method  we  have  developed 
chooses  6  so  as  to  minimize  the  variance  of  /  under  a  simplifying  approximation,  given  by 
the  following  theorem. 


Theorem  2  Suppose  that  L(x|Z)  takes  on  only  two  distinct  values  as  Z  varies.  Define 
Zm  =  argmax2{L{x\Z)p{Z)} .  Letdgpt  be  the  value  of  6  that  minimizes  the  variance  off  in  (7) 
over  all  h{Z)  of  the  form  in  (8),  and  let  V2{S)  =  var{var{I))  with  var{I)  =  K~^  J2k{^k  ~  , 

the  sample  variance  of  Ik ’s.  Then 


Jp(ZM|x,r  =  f)  -  L(x|Zm)p(Zm) 

/p(Zm|x,  r  =  f )  -  Ip{Zm) 


Moreover,  6opt  is  the  minimizer  of  V2{S)  over  h{Z)  of  the  form  in  (8). 


(10) 


The  proof  is  given  in  Appendix  B.  The  proof  shows  that,  regardless  of  the  form  of  h{-),  the 
most  efhcient  h{-)  in  this  simplified  situation  satisfies 

KZm)  =  pCZ^mI^)-  (11) 


That  is,  we  choose  our  importance  sampling  distribution  to  match  the  optimal  importance 
sampling  distribution  at  its  modal  value,  and  (11)  provides  a  potential  criterion  for  choosing 
h{-)  regardless  of  whether  h{-)  has  the  mixture  form.  For  peaked  likelihoods,  /p(Zjvr)  is  small 
relative  to  /p(Zm|x,  t  =  t)  and  can  be  dropped  from  the  expression. 

Equation  (10)  contains  two  unknown  quantities,  Zm  and  /,  which  must  both  be  esti¬ 
mated.  For  an  initial  estimate  of  Zm,  we  set  the  ifi^  component  of  Zm  to  be 

%  =  l[j  =  maxprob(2;j;t  =  l|xj,r  =  f)],  (12) 

k 

where  ![•]  represents  the  indicator  function.  That  is,  we  assign  the  observation  to  the 
component  with  maximum  posterior  probability  conditional  on  X  =  x  and  r  =  r.  Zm’s 
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obtained  in  this  manner  from  r’s  which  are  other  local  maxima  of  the  likelihood  equation 
should  be  examined  as  potentially  better  estimates  of  Zm-  Also,  Z’s  close  to  Zm  can  be 
examined  to  see  whether  any  has  a  higher  value  of  L(xjZ)p(Z),  providing  a  better  estimate 
of  Zm- 

The  value  /  in  (10)  is  replaced  by  /  estimated  using  6  =  0.5.  This  produces  a  two-step 
adaptive  procedure  for  estimating  I.  Even  though  the  justification  may  initially  appear  to 
be  overly  simplistic,  we  found  that  choosing  6  according  to  (10)  worked  well  in  simulations, 
resulting  in  var(/)  at  or  very  near  the  empirically  determined  lower  bound,  even  when  the 
assumption  of  only  two  values  for  L(x\Z)p(Z)  was  far  from  the  truth.  In  principle,  one  could 
take  more  than  two  steps  in  this  procedure  and  iterate  to  convergence  of  /,  but  in  practice 
we  found  little  advantage  to  going  beyond  two  steps  in  our  examples. 

The  defensive  mixture  method  need  not  be  limited  to  two  components  for  h{-).  Because 
p(Z|x)  =  / p(Zjx, r)p(x|r)p(r)dr//  can  be  approximated  arbitrarily  closely  by  a  weighted 
sum  of  terms  of  the  form  p(Zjx,  r^),  one  might  take 

^(Z)  =  l]5tp(Zix,r  =  Tt),  (13) 

with  T  >  1  and  =  1  —  5.  For  example,  in  a  data  set  where  there  is  significant 

multimodality  (beyond  that  due  to  label-switching),  the  f’s  in  (13)  could  correspond  to  the 
modes  of  the  likelihood  surface.  The  6  and  the  5t’s  would  be  fixed  at  1/(T  -|-  1)  to  obtain 
Iq,  an  initial  estimate  of  /,  and  then  chosen  adaptively  in  a  second  step  by  solving  the  set 
of  equations 

hC^t)  =  ^(Ztjx),  t  =  1, . . . ,  T.  (14) 

Here,  the  Z^’s  are  determined  by  the  fj’s  via  (12),  and  p(Zt|x)  =  L(x|Z)p(Z)//o.  This  vari¬ 
ant  of  the  DMIS  method  is  examined  in  our  simulations  with  Data  Set  5  in  Section  4. 


2.4  The  Z-Distance  Method 

Our  second  method  is  a  novel  sampling  method  that  directly  targets  for  sampling  Z  near 
Zm-  By  “Z  near  Zm”  we  mean  near  in  the  sense  that  |L(xjZ)p(Z)  —  L(xjZM)p(ZM)|  is 
relatively  small,  making  Z  an  important  point  to  sample.  For  motivation,  consider  the  case 
where  max^t 4'('2iA;  =  l|x, r  =  f)  is  not  much  greater  than  the  p{zik  =  l|x, r  =  f)  for  other 
A:’s.  Then,  the  point  near  Zm  with  the  observation  re-assigned  to  another  group  will  have 
a  value  of  L(xjZ)p(Z)  close  to  L(xjZM)p(ZM)-  Our  second  method,  “Z-distance”  sampling. 
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attempts  to  target  these  points  with  such  re-assignments.  Z-distance  sampling  is  done  as 
follows: 

1.  Divide  the  n  observations  into  an  initial  grouping  consisting  of  R  groups  with  R  G 
{1, . . .  ,n}.  Each  of  the  R  groups  comprises  observations  that  are  close  to  each  other 
in  the  sense  that  they  have  high  conditional  probability  of  being  in  the  same  group. 
For  example,  —  Zi^\\  is  small  for  observations  R  and  ^2  in  the  same  group,  where 
Zi^  is  defined  in  Theorem  1.  A  non-parsimonious  clustering  algorithm  can  be  used  to 
form  the  initial  groups,  or  the  initial  groups  can  be  those  given  by  Zm  in  (12).  Let  Vj 
be  the  number  of  observations  in  the  group. 

2.  A  Zf;  in  the  importance  sampling  step  is  formed  by  taking  the  Vj  observations  in  the 
group  and  redistributing  them  into  the  G  groups  according  to  a  Dirichlet-Multinomial 
distribution  with  mean  parameter  [ij  =  {/iji, . . . , /ijo)  and  dispersion  parameter 
that  may  depend  on  p(Z|x,  r  =  f). 

In  Appendix  C,  we  review  the  Dirichlet-Multinomial  distribution  with  the  above  parametriza- 
tion  and  provide  details  on  how  to  carry  out  the  sampling  in  practice.  The  choice  of  the 
Dirichlet-Multinomial  distribution  in  step  2  allows  for  correlation  between  the  elements  of 
Zj,  unlike  sampling  from  p(Zjx,  f).  This  method  also  contains  some  previously  considered 
choices  of  h{-)  as  special  cases.  If  each  observation  is  its  own  group  and  [lij  =  p{zij  =  l|f,  x), 
then  we  have  sampling  from  the  conditional  posterior  for  Z,  p(Z|x,  r  =  f).  If  the  prior  on  tt 
is  a  Dirichlet  distribution  (its  natural  conjugate  prior),  then  p{Z)  has  a  Dirichlet-multinomial 
distribution.  Hence,  the  above  algorithm  results  in  sampling  from  p{Z)  when  R  =  1  and 
p{Z)  is  used  in  step  2. 

Another  version  we  have  examined  that  allows  for  adaptive  estimation  of  the  importance 
sampling  distribution  is  to  set  pj  equal  to  the  mean  of  the  p(Z|xj,r)  over  Xj’s  in  the 
group,  and  take  =  cvj/{pj{l  —  pj)  —  Vj)  where  Vj  is  the  variance  of  the  p(Z|xj,  f)’s  in  the 
jth  group,  Pj  is  their  mean  and  c  is  an  estimated  tuning  parameter.  The  ad  hoc  justification 
for  this  is  that  w  =  n/(7r(l  —  tt)  —  n)  for  random  p’s  from  a  Beta(a,  ft)  distribution  where 
TJ-  =  a/{a  +  b)  is  its  mean,  v  is  its  variance  and  w  =  (a  -|-  b)~^.  We  chose  c  adaptively  using 
(11).  For  reference,  we  call  this  sampling  version  Z-distance  2,  or  ZD-2,  sampling.  See 
Appendix  C  for  another  suggested  variant  of  Z-Distance  sampling. 

The  choice  of  the  Dirichlet-multinomial  distribution  in  step  2  is  one  of  practicality  and 
could  be  modified.  This  formulation  of  Z-distance  sampling  is  a  very  general  one  that 
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allows  for  a  wide  range  of  sampling  schemes  that  cannot  all  be  evaluated  in  a  single  study. 
Nevertheless,  we  put  it  forth  as  a  general  proposal  given  its  intuitive  appeal  and  unifying 
nature. 

A  simple  version  of  the  Z-distance  method  that  we  have  evaluated  in  several  simulations 
is  the  “Uniform  Distance”  method,  or  UD  method,  defined  as  follows.  Set  R  =  G,  use  the 
initial  group  assignments  given  by  (12),  and  take  [ij  =  (1/G, . . . ,  l/G)  V  j  and  =  1/G  V  j. 
The  UD  method  has  the  following  properties  : 

(1)  It  is  adaptive  in  the  sense  that  it  uses  information  from  f  to  form  the  initial  grouping,  but 
it  does  not  require  estimation  of  additional  importance  sampling  distribution  parameters. 
A  modified  form  of  the  UD  method,  UD^o,^,  allows  R  >  G  initial  groups  to  be  chosen  by 
applying  a  non-parsimonious  clustering  algorithm.  This  appears  to  be  useful  when  the  data 
give  only  vague  information  about  the  true  groupings. 

(2)  While  ^(Zm)  ^  ^'(Zm|x)  as  in  the  methods  above,  K  can  be  chosen  to  control  the 
probability  of  including  in  the  sample  Q  times  for  chosen  Q  (Q  =  5,  for  example). 

(3)  The  sampling  is  symmetric  with  respect  to  label-switching.  That  is,  if  Zi  and  Z2  are 
identical  except  for  different  labeling  of  the  groups,  then  h(Zi)  =  h(Z2)  under  the  UD  method 
of  sampling.  Property  (3)  provides  an  easy  way  of  dealing  with  multimodality,  discussed  in 
more  detail  below. 

3  Label- Switching  and  Multimodality 

The  “label-switching  problem”  refers  to  the  fact  that  when  the  mixture  components  all  have 
the  same  parametric  form,  the  likelihood  has  the  same  value  for  different  labelings  of  the 
groups: 

/  (Xi  I  Zi  =  (Zii, . . . ,  Zio),  6>  =  (6>1, . . . ,  6>g),  vr  =  (tti,  . . . ,  ttg)) 

f  I  {^ipi  5  •  •  •  5  ^  i^Pi  5  •  •  •  5  ^Pg}^  ^  i^Pi  5  •  •  •  5  ^Pg))  ’ 

where  (pi, . . .  ,po)  is  any  permutation  of  (1,...,G).  In  general,  there  are  up  to  G!  dis¬ 
tinct  labelings  that  result  in  the  same  value  of  /(xjzjff),  and  we  say  that  these  label¬ 
ings  are  equivalent.  There  will  be  fewer  than  G!  distinct  labelings  when  not  all  the  pa¬ 
rameter  values  are  distinct.  Note,  in  particular,  that  this  is  the  case  when  two  or  more 
of  the  TTj’s  are  zero.  If  p(0,7r)  is  symmetric  with  respect  to  label-switching,  that  is  if 
p((0i,...,0G),(vri,...,7rG))  =  p((0pi,...,^PG),(vrpi,...,7rpQ)),  then  L(x|Z)p(Z)  will  have 
G!  modes  with  equal  mass.  Because  of  this  fact,  we  use  a  symmetric  prior.  This  results  in  no 
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loss  of  generality,  because  any  non-symmetric  prior  can  be  reformulated  into  an  equivalent 
symmetric  prior  with  the  same  total  mass  on  each  set  of  equivalent  points,  so  that  I  remains 
unchanged. 

The  defensive  mixture  method  as  described  above  produces  a  unimodal  sampling  dis¬ 
tribution  that  can  be  a  poor  approximation  to  p(Z|x)  under  multimodality.  The  obvious 
modification  is 

h{Z)  =  {1-5)-^  ^p(Zix,  T  =  Tb)  +  6p{Z),  (16) 

G.  j 

where  b  indexes  the  Gl  permutations  of  the  components  of  f .  The  sampling  is  best  done 
in  a  stratified  manner,  since  this  is  both  simpler  and  more  efficient  (Oh  and  Berger,  1993; 
Hesterberg,  1995):  if  K  samples  are  planned,  then  Z  is  drawn  from  p{Z)  for  K5  of  the 
samples,  and  from  each  of  the  p(Z|r  =  f^,  x)’s  for  K{1  —  5)/G\  of  the  samples.  We  may  still 
calculate  bgpt  using  (11). 

As  noted  above,  the  UD  method  is  already  symmetric  with  respect  to  label-switching, 
so  no  modification  is  needed. 

As  an  alternative  to  sampling  at  each  mode,  we  have  used  the  following  method,  which 
is  applicable  when  the  symmetric  prior  is  employed.  Assign  each  Z^  to  an  equivalence  class 
containing  all  the  Z’s  corresponding  to  an  equivalent  labeling  in  (15).  Hence,  the  equivalence 
class  to  which  Z^  belongs  consists  of  Z^  plus  all  distinct  permutations  of  the  G  columns  of 
Zk-  Denote  the  resulting  equivalence  classes  by  j  =  1, . . . ,  where  Ag,  the  number 
of  such  equivalence  classes,  need  not  be  known.  Sampling  from  h{z)  can  then  be  viewed  as 
sampling  from  the  Aj,  and  it  follows  that 

Ne 

^  =  E  E  i(x!Zi)p(Zi)  (17) 

j—1  Zk^Ej 

Ne 

=  E#(£'j)r(x|Zi)p(Zj).  (18) 

1=1 


where  Zj  is  any  Z  G  Ej  and  #{Ej)  is  the  number  of  Z’s  in  Ej.  It  is  easy  to  see  that 
#{Ej)  =  Gl/EOjl,  where  EOj  is  the  number  of  groups  with  no  members  in  it  (i.e  the  number 
of  columns  consisting  entirely  of  zeros)  for  Z’s  in  the  equivalence  class  Ej.  Let  j{Z)  denote 
the  subscript  of  the  equivalence  class  to  which  Z  belongs.  The  importance  sampling  estimate 
is 


K 

E 

k=l 


G\ 


AO 


'l(Zfc)' 


p{^k) 


(19) 
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In  practice,  the  equivalence  classes  need  not  be  enumerated.  At  each  iteration,  we  sample 
Zf;  from  h{-)  and  then  calculate  E0j(^Zk)  iteration.  The  last 

step  can  take  some  computing  effort.  However,  if  h{-)  is  highly  concentrated  at  one  of  the 
symmetry-based  modes,  approximated  by  h{Zk).  This  will  be  the 

case  when  the  likelihood  surface  is  peaked  and  5  is  not  close  to  1. 

Another  method  for  handling  the  label-switching  problem  under  a  symmetric  prior  is  the 
following.  Instead  of  estimating  /,  estimate  /  where 

‘=  f  i?i^Mxlz)p{Z)dZ  =  f  _^L(x!Z)p(z)i[z  e  S]dz,  (20) 

where  S  represents  any  one  of  Gl  equivalent  “regions”  of  integration.  “Regions”  is  put  in 
quotes  because  some  Z/^'s  will  have  fractional  membership  in  more  than  one  region.  For 
example,  when  G  =  2  and  n  is  odd,  5  =  {Z  :  Zi2  <  {n  —  l)/2}.  When  n  is  even, 

S  =  {Zk  :  YTj=i^  Zi2  <  nj2  —  1},  while  the  points  Z  with  YTj=\  Zk2  =  has  a  membership 
weight  of  .5  in  S.  Hence,  we  can  sample  near  one  mode  using  hs  as  constucted  in  Section 
2,  and  any  Z/^  that  falls  outside  of  S  for  that  mode  contributes  =  0  to  (7).  When  n 
is  even  and  Zk2  =  n/2,  is  weighted  by  .5  in  (7).  When  sampling  is  done  mostly 
near  the  mode,  few  Z^s  fall  outside  S  and  little  computing  efficiency  is  lost  in  exchange  for 
substantial  simplicity. 

With  this  particular  choice  of  5,  the  proposed  method  seems  likely  to  work  best  when 
the  estimated  proportions  in  the  different  groups  are  not  too  similar.  The  problem  is  similar 
to  the  label-switching  problem  in  MCMC  estimation  of  mixture  models,  and  methods  for 
dealing  with  this  could  also  be  adapted  to  the  present  setting.  The  most  promising  such 
methods  essentially  amount  to  different  choices  of  S  and  algorithms  for  finding  it,  and  the 
approaches  of  Celeux,  Hum  and  Robert  (2000)  and  Stephens  (2000)  might  be  useful  here 
also. 


4  Simulations  and  Applications 

Simulation  studies  were  carried  out  to  assess  the  performance  of  the  DMIS  and  UD  methods. 
Variations  of  the  Z-distance  method  were  applied  in  selected  cases.  Our  simulations  are 
divided  into  two  groups.  The  first  group  consists  of  three  small  scale  problems  (Data  Sets  1, 
2,  and  3)  where  performance  can  be  assessed  relative  to  the  exact  answer.  The  goals  of  this 
group  of  simulations  were  to  assess  the  absolute  and  relative  performance  of  these  methods, 
to  assess  the  sensitivity  of  the  DMIS  method  to  choices  of  5,  to  assess  the  sensitivity  of  the 
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Table  1:  Three  Modes  for  the  Likelihood  Surface  for  Data  Set  3. 


-log  lik 

h 

h 

TT 

L(x,  Zm)  X  10»° 

175.74 

.169 

.276 

.155 

2.48 

175.78 

.220 

.151 

.5 

0.0013 

175.91 

.185 

.532 

0 

11.3 

UD  method  to  the  initial  grouping,  and  to  determine  how  the  performance  of  each  method 
varied  depending  on  whether  the  likelihood  was  sharply  peaked  or  flat. 

In  the  second  group  of  simulations,  the  methods  are  applied  to  three  data  sets  (Data 
Sets  4,  5,  and  6)  with  larger  sample  sizes  (n=204).  The  goals  of  the  second  group  of 
simulations  were  to  assess  the  practical  applicability  of  each  method  for  larger  samples,  to 
determine  whether  there  were  any  obvious  breakdowns  in  the  methods,  to  obtain  a  rough 
assessment  of  the  accuracy  of  the  methods  for  larger-sample  problems,  and  to  determine 
whether  performance  varied  depending  on  the  extent  to  which  the  likelihood  was  flat  or 
peaked. 

For  all  the  simulation  studies,  the  data  were  in  the  form  of  X,  successes  in  n*  trials  where 
rii  was  known.  The  htted  model  was  a  two-component  binomial  mixture  with  uniform  priors 
on  the  two  mean  parameters  and  on  the  mixing  proportion  parameter.  We  used  K  =  1000  for 
Data  Sets  1,  2,  and  3,  and  K  =  10,  000  for  Data  Sets  4,  5,  and  6.  This  model  was  chosen  for 
the  simulation  study  in  order  to  study  the  performance  of  the  importance  sampling  methods 
for  a  particular  application  in  molecular  biology  (Newton  et  al,  1998;  Desai,  2000).  Data 
Sets  1  and  2  are  each  a  subset  from  one  of  two  allelotype  data  sets  (Barrett  et  al,  1996;  and 
Shibagaki  et  al,  1994,  respectively)  where  it  is  of  interest  to  determine  whether  there  exist 
two  binomial  components  (Newton  et  al,  1998).  Data  Set  3  is  simulated  data  from  a  single 
binomial  component  with  mean  0.22  and  n^’s  that  are  the  same  as  the  n^’s  in  Data  Set  2. 
[Should  we  show  the  data  sets  in  an  appendix?  -  AR,  5/12] 

The  resulting  likelihood  surfaces  for  r  for  the  three  data  sets  range  from  markedly  peaked 
to  flat.  The  likelihood  surface  for  r  for  Data  Set  3  is  relatively  flat  with  three  modes  of 
nearly  equal  height  which  correspond  to  disparate  values  of  r  (Table  1).  In  Table  1,  Zm  is 
determined  by  (12)  using  the  f  indicated  in  the  row.  Note  that  the  MLE  is  given  in  row  1, 
but  the  Zm  derived  from  the  MLE  accounts  for  only  2.5%  of  the  mass  of  /,  while  the  Zm 
in  the  third  row  accounts  for  11%  of  the  mass.  This  shows  that  the  MLE  does  not  always 
translate  to  the  best  estimate  of  Zm- 
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Table  2  shows  results  for  the  defensive  mixture  method  as  a  function  of  5.  We  calculated 
5opt  from  (10).  In  each  data  set,  5opt  gave  the  minimum  or  near  minimum  variance  for  /, 
indicating  that  the  theoretical  arguments  underlying  its  use  translate  to  good  performance 
in  practice.  As  expected  from  its  derivation,  5opt  performed  best  in  Data  Set  1  where  the 
likelihood  is  very  peaked  and  the  assumption  of  only  two  distinct  values  for  T(x|Z)  is  nearly 
satished.  However,  5opt  still  performed  well  in  Data  Set  3  where  the  assumption  is  not  even 
approximately  true.  Mean-squared  errors  as  a  function  of  5  followed  the  same  pattern. 
On  the  other  hand,  the  empirical  standard  deviation  of  the  /^’s  was  a  better  estimate  of 
the  (estimated)  true  standard  deviation  at  5opt  in  Data  Sets  2  &  3  than  in  Data  Set  1.  The 
results  suggest  that  a  5  in  [Sopt-,  0.99]  may  provide  the  best  balance  of  efficiency  and  estimated 
standard  deviations  that  are  accurate  estimates  of  the  true  standard  deviation. 

Comparative  results  for  the  DMIS  and  UD  methods  are  shown  in  Table  3,  along  with 
results  for  the  UD  method  as  a  function  of  the  initial  grouping.  For  comparison,  results 
from  Monte  Carlo  estimation  by  sampling  from  the  prior  (/i(r)  =  p(r))  are  also  shown.  The 
variants  UD-1,  UD-2  and  UD-3  in  Table  3  refer  to  using  the  Zm’s  calculated  from  (12)  using 
the  three  different  f’s  in  rows  1-3,  respectively,  in  Table  1  as  the  initial  groupings  for  the  UD 
method.  For  comparison,  the  table  entry  /i(r)  =  p(r)  refers  to  the  result  obtained  by  sam¬ 
pling  from  p{t)  for  integration  of  (1).  To  assess  the  results,  consider  the  scale  of  evaluation  of 
evidence  of  Bayes  factors  in  Jeffreys  (1961)  and  Kass  and  Raftery  (1995),  according  to  which 
a  Bayes  factor  between  1/3  and  3  constitutes  evidence  “worth  no  more  than  a  bare  mention.” 
This  suggests  that  we  would  want  Bayes  factors  (ratios  of  two  integrated  likelihoods)  to  be 
off  by  no  more  than  a  factor  of  3  with  high  probability.  Some  simple  calculations  suggest 
that  this  will  be  satished  if  the  ratio  of  \/MSE  to  I  is  no  more  than  about  0.2. 

Overall,  the  results  for  the  DMIS  and  UD  methods  were  satisfactory.  The  bias  was 
small,  the  coefficient  of  variation  was  well  within  the  desired  range  for  both  methods  and 
all  data  three  sets,  and  the  estimated  standard  errors  were  close  to  the  empirical  standard 
deviations,  on  average.  In  contrast,  Monte  Carlo  estimation  by  sampling  from  the  prior  for 
the  parameters  r  was  substantially  less  accurate  on  average;  more  samples  would  be  needed 
for  it  to  be  minimally  acceptable. 

In  Data  Set  1  the  DMIS  method,  with  its  ability  to  mimic  the  peaked  posterior,  had  the 
best  efficiency.  The  UD  method  and  sampling  from  the  prior  for  r  were  less  efficient  but 
they  did  yield  reasonably  accurate  standard  errors.  The  DMIS  and  UD  methods  performed 
comparably  in  Data  Set  2  for  the  modestly  peaked  likelihood,  and  the  UD  method  was 
slightly  better  than  the  DMIS  method  for  the  flat  likelihood  in  Data  Set  3. 


16 


Table  2:  Small  Sample  Simulation  Results  (N=17):  Performance  of  defensive  mixture  im¬ 
portance  sampling  as  a  function  of  5 


Data  set  1:  markedly  peaked  likelihood 

I  =  56.38  X  10“^® 

(i 

i 

SEM 

SD(/) 

VMSE 

0 

54.62 

2.71 

4.68 

5.00 

0.01 

55.38 

3.15 

5.01 

5.12 

0.027 

56.04 

3.78 

7.47 

7.48 

0.1 

55.79 

3.48 

6.19 

6.22 

0.208* 

55.21 

3.16 

4.44 

4.60 

0.5 

57.49 

5.13 

6.43 

6.52 

0.9 

56.61 

7.52 

6.84 

6.84 

0.99 

59.57 

18.84 

11.36 

11.79 

1 

66.55 

52.23 

78.89 

79.54 

Data  set  2 

I  = 

!:  peaked  likelihood 

51.29  X  10“*^ 

5 

i 

SEM 

SD(I) 

VMSE 

0 

44.27 

3.78 

5.38 

8.85 

0.01 

54.13 

11.42 

63.76 

63.83 

0.1 

50.27 

5.93 

13.53 

13.57 

0.5 

49.29 

4.62 

5.73 

6.07 

0.586* 

50.51 

5.51 

6.40 

6.44 

0.9 

50.43 

7.59 

8.23 

8.27 

0.99 

53.5 

16.04 

15.07 

15.23 

1 

56.89 

39.41 

86.41 

86.59 

Data  set  3:  flat,  multimodal  likelihood 

I  =  98.87  X  10“*° 

(i 

i 

SEM 

SD(7) 

VMSE 

0 

89.78 

18.3 

47.85 

48.71 

0.01 

95.12 

17.28 

25.58 

25.85 

0.1 

98.7 

11.88 

13.90 

13.90 

0.5 

97.86 

7.20 

7.83 

7.89 

0.9 

98.76 

7.24 

8.50 

8.50 

0.99 

98.25 

8.63 

9.14 

9.16 

1* 

97.62 

9.53 

9.54 

9.62 

NOTE:  Each  row  represents  results  of  100  trials  of  the  procedure  with  K  =  1000  for  each  trial. 

I  is  the  mean  of  I  over  the  100  trials,  SEM  is  the  mean  of  the  standard  errors  of  I  over  the  100 
trials,  and  SD(/)  is  the  empirical  standard  deviation  of  the  100  /’s  (taken  to  be  an  estimate  of  the 
true  standard  deviation  of  I  for  comparisons  with  SEM) . 

*  This  is  6 opt  from  (10). 
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Table  3:  Small  Sample  Simulation  Results  (N=17):  Comparison  of  Methods 


Data  set  1 

:  markedly  peaked  likelihood 

I  =  56.38  X  10“^® 

Method 

i 

SEM 

SD(/) 

VMSE 

DMIS 

55.21 

3.16 

4.44 

4.60 

UD 

57.10 

7.38 

8.50 

8.53 

II 

56.78 

13.12 

13.34 

13.35 

Data  set  2:  peaked  likelihood 

I  =  51.29  X  10 

-85 

Method 

i 

SEM 

SD(/) 

VMSE 

DMIS 

50.51 

5.51 

6.40 

6.44 

UD 

52.16 

6.13 

6.18 

6.24 

II 

52.17 

9.77 

10.44 

10.48 

Data  set  i 

3:  fiat,  multimodal  likelihood 

I  =  98.87  X  10 

-80 

Method 

I 

SEM 

SD(/) 

VMSE 

DMIS 

97.62 

9.53 

9.54 

9.62 

UD-1 

94.98 

6.20 

6.52 

7.67 

UD-2 

98.13 

6.64 

7.12 

7.16 

UD-3 

98.27 

6.63 

7.17 

7.20 

II 

98.66 

15.37 

13.70 

13.70 

NOTE:  Each  row  represents  results  of  100  trials  of  the  procedure  with  K  =  1000  for  each  trial. 
The  columns  are  defined  as  in  Table  2. 

UD-b  refers  to  the  Uniform  Distance  method  taking  the  initial  grouping  to  be  Zm  calculated  from 
(12)  using  f  from  row  h  of  Table  1. 
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The  three  larger  sample  simulation  data  sets  each  consisted  of  204  proportions.  Data  Set 
4  consists  of  12  replicates  of  Data  Set  1,  Data  Set  5  consists  of  12  replicates  of  Data  Set  3, 
and  Data  Set  6  consisted  of  204  replicates  of  the  proportion  8/40.  Data  Sets  4  and  5  were 
constructed  in  this  way  in  order  to  be  able  to  get  bounds  on  I  using  the  exact  values  of  I 
for  Data  Sets  1  and  3,  and  in  order  to  assess  the  impact  of  increasing  the  sample  size  while 
keeping  the  shape  of  the  likelihood  surface  hxed.  To  obtain  bias  estimates  for  Data  Set  5, 
the  true  I  was  taken  to  be  the  value  of  I  resulting  from  importance  sampling  integration  of 
(3)  by  drawing  5  million  samples  directly  from  the  prior  on  the  parameters,  p{9,  tt)  .  This 
appears  to  provide  a  reliable  answer  for  the  flat  likelihood  in  this  example.  Data  Set  6  is 
an  artihcial  example  with  an  easily  calculated  exact  answer  that  should  be  “easy”  for  any 
viable  candidate  method.  Data  Set  6  serves  as  a  gauge  for  assessing  whether  a  method  meets 
a  minimum  performance  standard. 

Results  for  the  larger  sample  sizes  are  shown  in  Table  4.  The  results  show  that  the  2- 
component  DMIS  method  performs  very  well  for  the  markedly  peaked  likelihoods  in  Data 
Sets  4  and  6,  but  performs  poorly  for  the  flat  likelihood  (Data  Set  5)  compared  to  sampling 
from  p{t).  Because  the  likelihood  for  Data  Set  5  has  three  modes  of  comparable  height,  a  4- 
component  DMIS  sampling  method  was  tested  here,  with  components  corresponding  to  each 
mode  and  to  the  prior  as  in  (13).  Taking  each  of  the  four  5’s  (component  sampling  weights) 
in  this  sampling  scheme  to  be  0.25  results  in  a  great  improvement  in  the  bias  as  well  as 
improvement  in  the  variance  of  I  compared  to  the  2-component  DMIS  importance  sampling 
distribution  (Table  4,  hrst  and  second  rows  under  Data  Set  5).  When  we  attempt  to  solve 
(14)  in  order  to  minimize  the  variance  resulting  from  the  4-component  importance  sampling 
distribution,  we  hnd  no  solution  in  [0, 1]®^.  Without  this  constraint,  the  solution  puts  ^3  at 
0.012,  5  <!  0,  ^2  ^  1  and  ^3  >  1.  Hence,  we  set  ^3  —  0.012,  5  —  0  and  split  the  remaining  mass 
between  82  and  ^3.  This  was  very  effective  in  reducing  var(/),  producing  a  54-fold  decrease 
in  var(/)  relative  to  the  equal  weight  case  (Table  4,  third  row  under  Data  Set  5).  However, 
the  estimate  of  I  was  biased  downward,  and  only  50%  of  the  mass  was  “found” .  Still,  the 
MSE  was  signihcantly  improved.  We  also  adjusted  the  5j’s  to  allow  5  =  0.05  (versus  5  =  0 
in  the  previous  simulation).  The  result  was  an  increase  in  the  variance  and  decrease  in  the 
bias,  giving  an  overall  decrease  of  5%  in  the  MSE  (Table  4). 

The  naive  UD  method  with  i?  =  G,  as  described  in  Section  2.2,  does  not  perform  well 
compared  to  the  DMIS  method  for  the  markedly  peaked  likelihood,  nor  does  it  perform  as 
well  as  sampling  from  p(r)  for  the  flat  likelihood  in  Data  Set  5.  (Note  that  the  UD  method 
and  the  DMIS  method  with  5  =  5apt  coincide  in  this  case.)  However,  the  ZD-2  method 
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Table  4:  Larger  Sample  Simulation  Results  (N=204) 


Data  set  4:  Peaked  Likelihood,  I  G  [37.4,486.1 

X  10-«'^ 

Method 

/ 

SEM 

SD(/) 

VMSE 

DMIS,  5  =  0.078 

65.32 

0.31 

0.30 

— 

UD 

66.62 

26.01 

30.91 

— 

ZD-21 

65.32 

0.11 

0.13 

— 

II 

78.82 

37.13 

39.10 

— 

Data  set  5:  Flat,  Multimodal  Likelihood,  I  G  [0.25, 1762J  x  10 

Method 

/ 

SEM 

SD(/) 

Vmse' 

2-component  DMIS  d  =  .5 

12.56 

7.65 

30.44 

31.16 

4-component  DMIS^ 

19.98 

8.13 

24.41 

24.42 

4-component  DMIS^p^ 

9.35 

2.08 

3.22 

10.41 

4-component  DMIS,  6  =  .05^ 

14.59 

3.63 

6.54 

8.03 

2-component  DMIS,  5  =  5opt  =  1;  UD^ 

11.83 

6.02 

14.04 

15.87 

mixture  of  3  UDs® 

15.37 

4.31 

6.59 

7.64 

UDmo(i,  R=3^ 

17.11 

3.36 

4.42 

4.09 

II 

18.85 

2.88 

2.63 

2.63 

Data  set  6:  One  Group;  Peaked  Likelihood,  I  =  20.95  x  10 

-1778 

Method 

/ 

SEM 

SD(/) 

VMSE 

DMIS,  5opt  =  0.7716 

20.99 

0.24 

0.24 

0.24 

UD 

21.06 

0.49 

0.40 

0.41 

II 

21.52 

6.58 

5.89 

5.92 

NOTES:  Each  row  represents  50  trials  with  K  =  10000  for  each  trial. 

*  estimated  based  on  assuming  I  =  19.240  obtained  as  I  from  5  million  samples  from  p(r).  The  standard 
error  of  this  estimate  is  0.001 
^  ZD-2  sampling  as  described  in  Section  2.2. 

^  The  IS  distribution  is  h{Z)  =  6ip{Z\x,T  =  n  +62p{Z\x,t  =  T2  +63p{Z\x,t  =  %  +5p{Z),  as  in  (13),  where 
the  TiS  correspond  to  the  three  modes  of  the  likelihood  surface  (Table  1).  5  =  5i  =  0.25,  i  =  1, 2, 3. 

®  /i  is  the  same  as  above,  except  that  5  and  the  d^’s  were  chosen  adaptively.  The  solution  to  (14)  puts 
da  =  0.012,  d2  >  1,  di  >  1  and  d  <  0,  so  {d,  di,  d2,  da}  =  {0,  .494,  .494, 0.012}  was  used. 

Like  the  case  directly  above,  except  that  d  was  set  to  0.05:  {d,  di,  d2,  da}  =  {.05,  .47,  .47,  .012}. 

®  In  this  case  the  2-component  DMIS  with  d  =  5opt  =  1  is  equivalent  to  the  UD  method,  since  the  latter 
puts  all  observations  into  one  group  initially  {R=  1). 

®  Three  versions  of  the  UD  method  were  mixed  in  equal  proportions.  Each  of  the  UD  initial  groupings  with 
R=2  was  formed  from  one  of  the  2m’s  calculated  from  (12)  using  the  three  f’s  in  (Table  1). 

^  Three  initial  groups  for  the  UD  method  (R=3)  were  created  using  an  ad  hoc  clustering  method  (inspecting 
the  values  of  p(zi|xi,r  =  tmle)  and  dividing  them  into  three  groups  according  to  their  proximity.) 
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and  the  simple  UD^od  method  brought  large  gains  in  efhciency  in  each  case,  respectively. 
Specihcally,  using  three  initial  groups  for  the  UD^od  method  (R=3)  reduced  the  estimated 
MSE  by  a  factor  of  10  for  Data  Set  5,  while  use  of  ZD-2  sampling,  as  described  in  Section 
2.2,  resulted  in  a  variance  reduction  for  /  by  a  factor  greater  than  5  x  10^  (Table  4).  For  the 
contrived  Data  Set  6,  both  the  naive  UD  method  and  the  DMIS  method  performed  well  and 
considerably  outperformed  sampling  from  the  prior.  Sampling  from  the  prior  peformed  best 
for  Data  Set  5,  emphasizing  that  one  should  always  investigate  simple  methods  when  the 
integrand  has  simplifying  features.  Most  applications  will  not  present  such  simple  likelihoods, 
however,  and  adaptive  methods  such  as  those  proposed  here  may  be  needed. 

5  Discussion 

We  have  proposed  a  general  approach  to  approximating  integrated  likelihoods  for  hnite 
mixture  models.  The  basic  idea  is  to  make  the  unobserved  group  memberships  the  argument 
of  the  integral,  and  then  to  use  importance  sampling  in  terms  of  the  group  memberships  to 
evaluate  the  integral.  The  tricky  part  of  this  is  hnding  a  good  importance  sampling  function, 
especially  since  the  dimension  of  the  resulting  integral  is  high  in  most  cases.  We  propose 
using  a  mixture  distribution  a  second  time  in  the  problem,  this  time  as  importance  sampling 
function,  and  we  outline  two  ways  of  doing  this:  defensive  mixture  importance  sampling, 
and  the  Z-distance  method.  We  propose  choosing  the  mixture  parameters  of  the  importance 
sampling  function  adaptively  so  as  to  minimize  the  variance  of  the  approximated  integrand, 
and  we  develop  a  simple  way  of  making  this  choice. 

The  resulting  method  is  easy  to  implement,  essentially  involving  only  simple  multino¬ 
mial  sampling.  It  does  not  become  much  more  complex  as  the  complexity  of  the  mixture 
model  increases,  and  extends  easily  to  more  complicated  mixture  models.  It  is  based  on 
independent  simulation,  and  so  does  not  encounter  problems  of  convergence  due  to  depen¬ 
dence  between  consecutive  samples.  It  also  produces  a  standard  error  and  so  to  some  extent 
is  self-monitoring,  although  care  must  be  taken  when  using  this.  We  also  propose  a  way  of 
dealing  with  the  label-switching  problem  that  can  plague  inference  for  mixture  models.  The 
method  can  also  allow  one  to  make  use  of  analytic  approximations  that  are  valid  when  the 
group  memberships  are  known  (such  as  the  Laplace  method  and  BIC),  even  though  these 
are  not  valid  for  the  mixture  model  itself. 

The  method  seems  to  have  some  advantages  over  the  main  alternative  methods  that 
have  been  proposed.  Analytic  approximations  such  as  Laplace  and  BIC  are  not  valid  for 
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finite  mixture  models.  The  number  of  parameters  in  mixture  models  tends  to  outrun  the 
capabilities  of  numerical  quadrature  rather  quickly.  And  our  methods  seem  easier  to  imple¬ 
ment  than  MCMC,  also  addressing  the  convergence  and  label-switching  problems  that  can 
affect  MCMC  for  mixture  models.  For  further  discussion  of  the  relative  merits  of  importance 
sampling  and  MCMC  in  a  different  context,  see  Stephens  and  Donnelly  (2000). 

The  basic  idea  of  our  method  is  not  limited  to  mixture  models,  and  can  be  applied  more 
generally  to  models  that  can  be  expressed  simply  in  terms  of  latent  or  missing  data,  i.e. 
models  that  lend  themselves  to  the  EM  algorithm  or  to  data  augmentation.  These  include 
simple  hierarchical  models,  variance  component  models,  and  multiple  imputation  models  for 
missing  data. 

Various  other  approaches  have  been  proposed  for  approximating  integrated  likelihoods 
for  models  of  this  kind  using  the  output  of  the  EM  algorithm  or  similar  results.  Cheeseman 
and  Stutz  (1995)  proposed  the  estimator 

/c5  =  A(xjZ)p(Z)//i(Z),  (21) 

where  /i(Z)  =  p(Z|x,  f)|2;^z.  Our  discussion  here  sheds  some  light  on  this  estimator  of  the  in¬ 
tegrated  likelihood.  Recall  that  I  =  J  L{x\Z)p{Z)dZ^  and  that  the  importance  sampling  esti¬ 
mator  of  /  based  on  the  importance  sampling  function  h{-)  is  /  =  K~^  J2k=i  L{'x.\Zk)p{Zk)/h{Zk). 
Now  the  integrated  likelihood  can  be  rewritten  as 

1  =  j  v{Z)h{Z)dZ,  (22) 

where  v{Z)  =  I/(x|Z)p(Z)//i(Z).  The  importance  sampling  estimator  is  then  seen  to  be  the 
simple  Monte-Carlo  integration  estimator 

K 

/  =  iF-i^n(Z,).  (23) 

k=l 

The  Cheeseman-Stutz  estimator  is  then  just 

ics  =  v{Z).  (24) 

From  (22),  (23)  and  (24),  it  is  apparent  that  instead  of  the  simulation-consistent  approach 
of  integrating  over  values  of  Z  simulated  from  /i(-),  the  Cheeseman-Stutz  estimator  proceeds 
by  replacing  Z  by  Z  in  the  integrand.  This  is  not  valid  in  the  present  context,  and  the 
Cheeseman-Stutz  estimator  will  be  biased  in  general,  even  with  an  inhnite  amount  of  simu¬ 
lation.  Note  that  the  published  derivation  of  the  Cheeseman-Stutz  estimator  was  based  on 
a  Laplace  approximation  that  is  not  valid  in  general  for  mixture  models. 
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Chickering  and  Heckerman  (1996)  proposed  a  modification  to  the  Cheeseman-Stutz  esti¬ 
mator  to  take  account  of  the  possibility  that  the  dimension  of  the  parameter  space  for  the 
completed  data  set  (x,  Z)  may  be  different  from  that  for  the  imcomplete  data.  The  need  for 
such  an  adjustment  arises  only  because  of  the  Laplace  method  used  to  derive  the  Cheeseman- 
Stutz  estimator  which,  as  we  have  seen,  is  in  any  event  invalid  for  general  mixture  models. 
No  such  adjustment  is  needed  in  our  importance  sampling  approach. 

Biernacki,  Celeux  and  Govaert  (2000)  have  proposed  the  Integrated  Completed  Likeli¬ 
hood  (ICL),  which  is  similar  to  the  Cheeseman-Stutz  approach  in  that  it  is  based  on  a  single 
value  of  Z,  but  replaces  the  elements  of  Z  by  their  most  likely  values  given  the  data  and 
f  (rather  than  by  their  expected  values  as  in  Cheeseman-Stutz),  and  then  integrates  the 
resulting  completed  likelihood  over  r.  Thus,  Biernacki  et  al.  (2000)  report 

p(xjZM)  =  J  p(x|ZM,r)p(r|ZM)dr, 

where  Zm  is  given  by  (12).  This  is  clearly  not  valid  as  an  approximation  to  the  integrated 
likelihood,  but  Biernacki  et  al.  (2000)  do  not  claim  that  it  is,  and  argue  for  it  instead  in  its 
own  right  as  the  solution  to  a  scientihc  problem  that  is  of  interest  in  some  contexts. 

Wei  and  Tanner  (1990)  proposed  importance  sampling  using  h{Z)  =  p(Zjx, r  =  f), 
a  method  they  called  Poor  Man’s  Data  Augmentation-2  (PMDA-2).  They  used  this  for 
parameter  estimation,  and  not,  as  here,  for  integrated  likelihoods.  However,  it  could  be  used 
for  integrated  likelihoods,  and  then  it  would  be  a  special  case  of  DMIS,  with  5  =  0.  As  an 
importance  sampling  function,  this  may  miss  important  regions  away  from  the  MLE,  and 
the  resulting  estimator  of  the  integrated  likelihood  may  have  high  variance  (or  even,  if  Z  is 
continuous,  inhnite  variance).  In  Wei  and  Tanner’s  (1990)  proposed  PMDA-1  method,  the 
importance  sampling  weights  are  all  set  to  be  equal.  This  greatly  reduces  the  variance,  but 
the  resulting  estimator  is  biased,  even  asymptotically  (i.e.  with  an  amount  of  simulation 
that  tends  to  inhnity). 

The  methods  we  propose  here  are  in  the  spirit  of  the  nonparametric  importance  sampling 
method  of  Zhang  (1996),  in  the  sense  that  we  obtain  an  adaptive  estimate  of  the  most  efhcient 
importance  sampling  distribution.  Adaptive  procedures  for  estimating  an  intractable  p{Z) 
are  described  by  Evans  and  Swartz  (1995),  Oh  and  Berger  (1993)  and  Givens  and  Raftery 
(1996).  Our  goal  here  is  to  go  beyond  estimating  p(Z)  and  estimate  an  importance  sampling 
distribution  which  is  closer  to  the  ideal  distribution  p(Z|x)  than  to  p(Z).  In  addition,  we 
have  addressed  the  issue  of  accurate  estimation  of  the  standard  error  of  /. 

Raghavan  and  Cox  (1998)  proposed  a  different  adaptive  algorithm  for  estimating  the 
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mixing  parameter  5  in  defensive  mixture  importance  sampling.  Their  method  is  designed 
for  importance  sampling  problems  where  there  is  more  than  one  estimand  of  interest.  They 
employ  a  complex  minimization  and  reweighting  scheme  that  allows  the  researcher  to  try  to 
minimize  the  asymptotic  variances  of  several  importance  sampling  estimators  based  on  the 
same  randomly  sampled  values.  We  can  avoid  using  computational  minimization  techniques 
by  using  the  much  simpler  optimal  5  result  in  Theorem  2,  because  we  are  interested  only  in 
calculating  the  integrated  likelihood. 

We  have  explored  various  possible  versions  of  DMIS  and  UD,  hnding  that  in  situations 
where  the  likelihood  is  not  highly  peaked,  better  performance  can  be  achieved  by  adding 
components  to  the  defensive  mixture,  or  by  using  mixtures  of  UD  methods.  It  seems  plausible 
that  the  method  could  be  improved  by  developing  a  systematic  way  of  determining  when 
additional  components  are  needed  in  the  defensive  mixture,  and  adding  them  automatically. 
One  possible  approach  is  adaptive.  Consider,  for  example,  h{Z)  to  be  a  mixture  of  functions 
of  some  form  such  as  (13)  or,  more  parsimoniously,  location-  and/or  scale-shifted  versions  of 
p(Z|Z).  In  a  hrst  stage,  two-component  DMIS  would  be  used  to  generate  a  hrst  (weighted) 
sample  from  p(Zjx).  Then,  using  the  EM  algorithm,  a  mixture  of  functions  of  Z  of  the 
selected  form  could  be  ht  to  the  resulting  sample,  for  example  using  some  criterion  such 
as  BIC  to  choose  the  number  of  components  and  the  selected  functional  form.  In  this  way 
additional  components  would  be  added  automatically  to  the  defensive  mixture  if  needed  to 
hll  out  areas  that  are  underrepresented  in  the  hrst  iteration.  This  adaptive  process  could  be 
continued  iteratively,  although  it  seems  likely  that  the  improvement  would  be  small  past  a 
few  iterations. 

Owen  and  Zhou  (2000)  discussed  the  use  of  control  variates  to  improve  the  performance  of 
defensive  mixture  sampling  for  integration.  The  control  variate  method  provided  impressive 
gains  in  efficiency  in  the  examples  therein.  However,  our  trials  of  the  control  variates  did 
not  improve  the  performance  of  the  DMIS  method  for  integrated  mixture  likelihoods  when 
the  ftted  G  was  greater  than  the  true  G  (results  not  shown).  More  research  is  needed  to 
determine  how  the  techniques  proposed  by  Owen  and  Zhou  may  complement  those  described 
here. 

A  more  direct  method  for  sampling  Z’s  that  are  moderately  close  to  Zm  is  embodied 
in  the  Z-distance  sampling  approach  introduced  here.  This  method  is  a  promising  avenue 
for  further  research.  The  Zi's  can  be  sampled  in  a  designed,  dependent  manner  that  refects 
obvious  features  of  the  data,  and  this  may  well  have  the  potential  to  lead  to  improved 
methods.  For  example,  the  initial  groups  can  be  based  upon  the  fact  that  particular  pairs  of 
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points  are  much  more  likely  to  be  in  the  same  group  than  other  pairs.  Furthermore,  in  its 
more  complex  forms,  Z-distance  sampling  allows  tuning  of  its  Dirichlet-multinomial  sampling 
distributions  to  the  data.  The  simplest  form  of  Z-distance  sampling,  the  Uniform  Distance 
method  with  R  =  G,  appears  to  be  adequate  for  small  samples  and  moderately  peaked 
likelihoods.  However,  it  uses  little  information  from  the  data  and  may  be  too  inefhcient  for 
many  problems.  For  our  large-sample  flat  likelihood  test  case  (Data  Set  5),  was  improved 
markedly  by  using  more  data  information  simply  by  choosing  R  >  G  initial  groups  via 
a  non-parsimonious  initial  clustering  step.  This  method  may  be  particularly  useful  when 
models  with  G  greater  than  the  true  G  will  be  htted,  resulting  in  flat  likelihood  surfaces 
with  many  minor  modes.  A  problem  for  future  research  is  to  hud  a  method  for  choosing  the 
number  of  initial  groups  to  optimize  the  MSE,  say. 

For  peaked  likelihoods,  ZD-2  sampling  performed  as  well  as  the  DMIS  method,  using  (10) 
to  tune  the  importance  sampling  distribution  parameters.  The  Z-distance  sampling  method 
proposed  here  allows  very  flexible  sampling  schemes.  Even  more  flexible  sampling  schemes 
could  be  obtained  by  allowing  the  groupings  to  vary  with  k  and/or  allowing  the  assignments 
in  the  j  +  1G  +  2,  ...R  groups  to  be  conditional  on  the  assignments  in  the  1, . . . ,  j  groups. 

One  place  where  difhculties  might  arise  with  our  variance-optimization  method  is  in 
obtaining  adequate  initial  estimates  of  I  and  Zm-  However,  the  need  for  some  sort  of  infor¬ 
mation  about  the  integrand  is  a  general  requirement  for  optimizing  an  importance  sampling 
technique,  not  a  shortcoming  of  our  method  in  particular.  In  our  examples,  we  had  no 
trouble  hnding  a  suitable  initial  /  by  taking  5  =  .5  in  the  2-component  DMIS  method. 
Nevertheless,  it  is  conceivable  that  the  2-component  DMIS  method  might  fail  completely 
for  some  data/model  combinations.  A  simple  Laplace-based  method  such  as  BIG,  even  if 
inadequate  as  a  hnal  value,  may  be  good  enough  to  be  used  as  an  initial  value. 

The  choice  of  a  suitable  importance  sampling  method  (with  a  suitable  number  of  compo¬ 
nents  when  the  DMIS  method  is  used)  may  be  easier  if  the  data  analyst  has  some  knowledge 
of  the  likelihood  surface.  For  complex  problems,  this  can  be  obtained,  for  example,  by  exam¬ 
ining  the  surface  using  the  EM  algorithm  or  other  mode-hnding  routine  on  a  grid  of  starting 
points.  This  also  provides  a  search  for  Zm  by  applying  (12)  at  each  located  mode. 

No  single  importance  sampling  variant  can  be  expected  to  work  for  every  data  set,  and 
some  knowledge  of  the  integrand  is  required  in  order  to  choose  an  appropriate  variant. 
However,  the  small  set  of  variants  presented  here  seems  to  hold  some  promise  of  being  useful 
over  a  reasonably  wide  range  of  integrands.  The  methods  studied  here  are  also  simple,  and 
we  provide  theoretical  insights  into  constructing  good  importance  sampling  distributions 
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through  Theorems  1  and  2.  Like  Stephens  and  Donnelly  (2000),  we  conclude  that  importance 
sampling  deserves  further  development  as  a  viable,  potentially  simpler  alternative  to  MCMC 
methods  for  some  problems. 

APPENDIX 

A.  Proof  of  Theorem  1  For  this  proof,  we  redehne  r  =  (tti,  . . . ,  ttg-i,  0')^  that  the 
information  matrices  can  be  assumed  to  be  positive  dehnite.  Assume  that  the  following 
conditions  hold: 

(1)  r  is  r-dimensional  with  prior  density  p(r)  that  is  continuous  at  Tq,  and  p(ro)  >  0. 

(2)  fn  A  To  where  tq  is  a  point  in  [0, 1]*^“^  x  0. 

(3)  There  exists  a  function  M(X;  tq)  with  the  property  that  given  any  e  >  0 
^log/x,z{Xi,i(Xj,ro)|r)  -  ^  log/x,z{Xi,  5(Xj,  ro)|ro)  <  M(Xj;ro) 

with  £'[M(Xj;  tq)]  <  e  whenever  |r  —  ro|  <  5  for  some  5. 

(4)  The  functions  Zij(X.i]  r)  are  simultaneously  continuous  at  tq  with  probability  going  to  1; 

(5)  There  exists  a  function  M(Xi]  tq)  with  the  property  that  given  any  e  >  0 

^log/x{Xj|r)  -  ^log/x{Xj|ro)  <  M(Xi]To) 
with  E[M(Xi]  To)]  <  e  whenever  \t  —  To\  <5  for  some  5. 


We  suppress  some  subscripts  for  ease  of  notation  in  the  proof.  Partition  the  ratio  of 
interest  as  follows: 

P„(I(XW)|X<">)  ^  /p(X<">,l(XW)|T)p(T)(iT  p(XW|t  =  f) 

p„(-S(X(”))|X(”),  r  =  f„)  p{X(”),  i(X(”))|r  =  f)  /p(X(”)  |r)p(r)(ir 


p„(-S(X(”))|X(”),  r  =  f„)  p{X(”),  i(X(”))|r  =  f)  /p(X(”)  |r)p(r)(ir 

Di  D2 

The  proof  is  similar  to  that  of  Walker  (1969)  to  show  that 

Walker’s  9  and  /(Xj|0)  are  replaced  by  r  and  /x,z(Xj,  ij(Xj,  f)|r),  respectively,  and  a  few 
extra  steps  are  needed  to  show  that  .2j(Xj,f)  can  be  replaced  by  ij(Xj,ro)  in  determining 
the  limit.  Briefly,  the  integrand  in  Ni  is  expanded  about  f  to  obtain 

W  =  yexp(log(p(X(”),i(X(”))|r)p(r)))dr  (27) 

=  p(X(”),  5(X(”))|r  =  f)p(f)  J exp{^(r  -  f)'C'„(f;  f)(r  -  f)[l  +  Rin  +  R2n\}dT, 
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where  t*  G  (f,r), 


Cn{f]T*) 

Rln 

^2n 


|^log!,(X("),i(X("))|T)|,.. 
log;)(r)  -  \ogp{f) 

(r  -  fy[Cn{f,  T*)  -  Cn{f,  f)](r  -  f) 
(r  -  f)'C'„(f,f){r  -  f) 


Here  we  have  used  the  fact  that 


=  0  at  f. 

OT 


The  assumptions  ensure  that  Rin  and  i?2n  converge  in  probability  to  zero,  uniformly  over 
r  in  a  neighborhood  of  tq,  and  that  the  integrand  is  negligible  outside  this  neighbor¬ 
hood.  Also,  \Cn{f,f)\/n'^  A-  |J(ro)|  by  continuous  mapping  and  (26)  follows.  Similarly, 
D2/ A  (27r)''/^|  J(ro)p/^,  was  shown  by  Walker  (1969),  which  establishes  the 
theorem. 

B.  Proof  of  Theorem  2 

Let  j  index  the  G”  values  of  Z,  and  let  hsj  denote  the  probability  of  Zj  under  a  generic 
importance  sampling  distribution  h{-)  that  depends  on  5  in  some  manner.  Also  let  kj  be  the 
number  of  times  Zj  is  drawn  in  a  sample  of  size  K.  Note  that  the  variance  of  /  is  given  by 


var(/)  =  ui(5)  =  var(^  ^ 

^  1=1 


L{ii.\Zj)p{Zj)k^ 


h 


Sj 


) 


1  „  L(x|Z,)V(Z,)^(l  -  hi,) 
K  y  htj 


i:i(x|Z,)p(Z,.)L(x|Z.)p(Z.)), 


(28) 


since  (/ci, . . . ,  kj)  is  multinomial  {K,  {hsi, . . . ,  hsc^))-  Note  that  omitting  the  dependence  on 
5  and  minimizing  (28)  with  respect  to  each  hj  results  in 


hj  =  L(xiZj)p(Zj)/^L(x|Zfe)p(Zfe)  =p(Zj|x), 

k 

the  posterior  probability  of  Zj.  This  shows  that  the  unknown  p(Zjx)  is  the  optimal  impor¬ 
tance  sampling  distribution  in  terms  of  minimizing  var(/). 

Returning  to  h’s  of  known  form  depending  on  5,  a  reasonable  simplifying  assumption  at 
this  point  is  to  lump  Z’s  with  the  same  or  nearly  the  same  values  of  L(x|Z)p(Z)  into  a  single 
equivalence  class  and  take  h{Z)  to  be  the  same  for  all  Z’s  in  an  equivalence  class.  Because 
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there  is  a  class  of  points  with  negligible  mass  for  L(x|Z)p(Z),  we  found  it  reasonable  to  ap¬ 
proximate  the  minimization  problem  by  assuming  just  two  equivalence  classes  corresponding 
to  the  “non-negligible”  and  “negligible”  classes  of  points,  respectively.  Suppose  there  are 
Cl  elements  in  the  “non-negligible”  class  (class  1)  and  C2  elements  in  the  “negligible”  class 
(class  2)  ,  with  /i  and  /2  being  the  respective  values  of  L(x|Z)p(Z)  and  hsi  and  hs2  being 
the  respective  values  of  h.  Then, 


ni(5) 


ci/i  (1  -  hsi)  ^  C2/|(l  -  hs2) 
hsi  hs2 


-I-  constant. 


(29) 


Now,  minimizing  (29)  with  respect  to  subject  to  the  constraint  Cihsi  -|-  C2hs2  =  1  gives 


/i  L{y.\ZM)p{'^M)  ^ 

-  - - 1 - 


(30) 


The  last  equality  follows  from  the  fact  that  Zm  must  be  in  the  non-negligible  class  of  points. 
Note  that  (30)  does  not  depend  on  the  actual  form  of  h{-).  When  h{-)  has  the  form  of  (8), 
(10)  follows  from  (30). 

One  of  the  goals  of  this  research  is  to  hnd  an  importance  sampling  method  whose  precision 
can  be  accurately  monitored  using  the  sample  variance  of  the  /^’s.  Hence,  it  is  of  interest 
to  choose  h  to  minimize  or  nearly  minimize  V2{S)  =  var(vaf(/)).  In  this  subsection,  we  show 
that  hsi  in  (30)  also  minimizes  V2{S)  under  the  conditions  above.  Specihcallyy, 


1 


K 


var(var(/))  =  var(— ^(4  -  7)^) 

k=l 


hi  -  hi  ‘^{hi  -  2/^i)  ^hi-  ^hl 


K 


(31) 

(32) 


where  pj  represents  the  central  moment  of  the  distribution  producing  the  IID 
(Cramer,  1946).  Let  r/fe  =  1  if  is  in  class  1  and  let  pk  =  0  ii  Z^  is  in  class  2.  Then 


hk  —  -^(^1  ^A:) 


pC^k) 

h{Zk) 


Hence, 


/l  (1  'I 

7— 2/a  +  7— (1  -  2/a) 
tl61  <^52 

(33) 

(  fi  /z  \ 

\hsi  hs2)^''  hs2 

(34) 

Mi{hsi)yk  +  M2{hsi). 

(35) 

=  Mfpq{l  -  3pq), 

(36) 
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where  p  =  Cihsi  =  E[yk]  and  q  =  1  —  p.  This  gives 

var(var(/))  oc  {K  —  l)Mfpq  —  2{2K  —  3)Mfp^q^.  (37) 

It  can  be  verihed  directly  that  the  hsi  in  (30)  is  a  zero  of  the  derivative  of  (37)  and  that  it 
represents  a  minimum. 


C.  Z  distance  Sampling  Using  a  Dirichlet-Mnltinomial  distribntion 

The  Dirichlet-Mnltinomial  distribution  is  a  compound  distribution  that  describes  the  marginal 
distribution  of  X  when  X|p  is  Multinomial(n,p  =  (pi,  •  •  •  ,Pg)):  and  p  has  a  Dirichlet  distri¬ 
bution  whose  joint  density  is  given  by 

CpT-^...P^g^-\  (38) 


where  pi  >  0,  Y,Pi  =  1,  <5  =  and  o;,  >  0.  Then  the  marginal  mean  of  X  is  np,  where 
the  mean  parameter  p  =  (cti, . . . ,  etc) /'S';  and  the  dispersion  parameter  for  X  is  w  =  1/S. 
The  probability  mass  function  for  X  is 


Prob[X  =  (xi, . .  .,xg)] 


n\r{uj  +  PiUJ  ^) 

n*  xd  n*  T{piUj-^)T{n  +  w-i)  ■ 


(39) 


To  facilitate  the  Z-distance  sampling  described  in  section  2,  it  is  useful  to  note  that  the 
number  of  observations  falling  into  group  j  is  beta-binomial  with  probability  mass  function 

n!r(cc;“^)r(c -|-  pico~^)r{n  —  c-|-  (1  —  pi)co~^) 


Prob[Xj  =  c 


c!(n  —  c)!r(/riCc;“^)r((l  —  pi)co~^)T{n  -|-  co~^) 
n!r(iS)r(c  -|-  Q;i)r(n  —  c  -|-  iS  —  aai) 
cl{n  —  c)!r(Q;i)r(iS  —  Q;i)r(n  -|-  S) 


(40) 

(41) 


Given  Xi  =  c,  the  conditional  distribution  of  (X2, . . .  ,Xg|Xi  =  c)  is  Dirichlet-Mnltinomial 
with  mean  parameter  p'  =  (/r2, . . . ,  Pg)/  Pj-,  and  dispersion  parameter  uj'  =  uj/  Pj- 
Hence,  the  distribution  of  the  n  observations  into  the  G  groups  can  be  done  as  a  sequence  of 
beta-binomial  samples.  This  is  particularly  easy  when  p  =  (1/G, . . . ,  1/G)  and  a;  =  1/G  for 
the  Uniform  Distance  method.  The  Uniform  Distance  method  is  a  variant  of  our  Z-distance 
method  that  incorporates  a  limited  amount  of  information  from  p(Z|x,  r  =  f).  When  neces¬ 
sary,  improved  performance  might  be  achieved  by  taking  pj  =  (1  —  q;)(0,  . . . ,  1, . . . ,  0)'  -|-  ap, 
where  the  hrst  component  vector  has  a  1  in  the  position,  and  the  element  of  the 
second  component  vector  is  given  by 


S^egroup  j  Prob[Z;fc - l|x,  r  —  f 

S/Ggroup  j  Prob[Z;fc  ==  l|x,  r  =  f 
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CO  could  be  fixed  at  l/G  again,  and  a  would  be  chosen  to  satisfy  equation  (11).  With  this 
method,  observations  that  are  in  group  j  in  the  initial  grouping  based  on  are  most  likely 
to  be  redistributed  into  groups  close  to  group],  whereas  the  Uniform  Distance  method  redis¬ 
tributes  the  observations  uniformly  into  the  groups.  Another  adaptive  variant  of  Z-distance 
sampling,  ZD-2,  is  discussed  in  the  text. 
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